STAT331 Final Project Report

Author

Thien An Tran, Tejasree Kandibanda, Matthew Huang, Chloe Anbarcioglu

1 Introduction

For our project, we want to explore the relationship between murder rate and happiness score in each country per year. Specifically, we will examine how variations in the murder rate per 100k people (explanatory variable) might influence changes in the happiness score on a 0 to 100 scale (response variable) across different countries over time.

While many factors could affect a country’s happiness score, such as economic status or weather conditions, delving into this relationship would reveal a significant factor that shapes a country’s overall happiness. If there is a correlation between murder and happiness score, it opens up avenues for developing preventative strategies which would help foster safer and happier countries.

We hypothesized that there would be a negative association between average happiness scores and murder rates. It is plausible that higher murder rates would coincide with a worse perception of security, resulting in lower happiness rates. There could also be factors such as government instability or even organized crime, which may lower happiness as well as increase murder rates.

1.1 Data Cleaning

We obtained our data from Gapminder (n.d.). The datasets include:

Population: Contains total population counts for each country per year. There are 197 rows/observations (a country) and 302 columns/variables (1800- 2100).

Murders (total deaths): Contains total number of estimated deaths from interpersonal violence for each country per year. There are 204 rows/observations (a country) and 31 columns/variables (1990-2019).

Happiness Score (WHR): Contains happiness score for each country per year, representing the national average response to the question, “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. Gapminder has converted it to a 0 to 100 scale so it’s in terms of percentage. There are 163 rows/observations (a country) and 19 columns/variables (2005-2022).

In order to find the relationship between murder rates and happiness score for each country per year, we need to merge the total murders data set and population data set. That way, we can get the murder rate per 100K people. We would also need to convert both datasets into a long format so each row is a unique observation consisting of a country and a year.

Note

We first have to convert all the values into numbers (i.e. fix cases such as 1.1k to be 1100).

Code
convert_value <- function(val) {
  val <- as.character(val)
  
  multiplier <- case_when(
    str_detect(val, "k") ~ 1e3,
    str_detect(val, "M") ~ 1e6,
    str_detect(val, "B") ~ 1e9,
    TRUE ~ 1
  )
  
  numeric_value <- as.numeric(str_remove_all(val, "[kMB]"))
  
  return(numeric_value * multiplier)
}

murder_clean <- murder |>
  select(country, `2005`:`2019`) |> 
  pivot_longer(cols = `2005`:`2019`,
               names_to = "year",
               values_to = "murder_count") |> 
  mutate(across(murder_count, ~convert_value(.)))

population_clean <- population |>
  select(country, `2005`:`2019`) |>
  pivot_longer(cols = `2005`:`2019`,
               names_to = "year",
               values_to = "population") |>
  mutate(across(population, ~convert_value(.)))

After cleaning the total murders and population data set, we can merge them to get a data set of the murder rate per 100k people for each country and year. We can then use pivot longer to transform the happiness score data set into a long format (so each row is a unique country and year) and merge it with the murder rate per 100k data set to get our final data set.

Code
happiness_clean <- happiness |>
  select(country, `2005`:`2019`) |>
  pivot_longer(cols = `2005`:`2019`,
               names_to = "year",
               values_to = "happiness_score") |>
   drop_na(happiness_score)

murder_happiness <- murder_clean |>
  inner_join(population_clean, by = c("country", "year")) |>
  mutate(murder_rate_per_100k = (murder_count / population) * 100000) |>
  inner_join(happiness_clean, by = c("country", "year"))

We made our final dataset to be between 2005 and 2019 as they were the common years among the datasets. The final data set contains 1,820 rows and 6 columns. The columns are country, year, murder_count, population, murder_rate_per_100k, and happiness_score.

It provides a comprehensive overview of the murder rates and happiness scores across various countries and years. Each entry in the data set corresponds to a unique combination of one out of 162 countries and a year ranging from 2005 to 2019.

2 Linear Regression

We will use linear regression to model the relationship between our two quantitative variables, murder rate (log) and happiness score. We want to see if our hypothesis from above holds true and whether there is a negative association between the two variables.

We opted to take the logarithm of murder rate because we observed a skewed distribution where countries with smaller murder rates occurred more often than those with larger murder rates. This transformation improved linearity, stabilized variance, and enhanced the overall model fit.

2.1 Data Visualization

These next two data visualizations explore the relationship between our two quantitative variables. Using these visualizations, we aim to gain a deeper understanding of the relationship between the variables and uncover any interesting patterns or trends that may emerge over time and across countries.

Code
animated_plot <- ggplot(murder_happiness,
                        aes(x = log(murder_rate_per_100k),
                            y = happiness_score)) +
  geom_point(color = "steelblue") +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Relationship Between Murder Rate and Happiness Score (2005-2019)",
       subtitle = "Happiness Score (0-100)",
       x = "Log-Scaled Murder Rate (per 100k)",
       y = "",
       caption = "Year: {frame_time}") +
  transition_time(as.integer(year)) +
  enter_fade() +
  exit_fade() +
  theme_bw() +
  theme(plot.caption = element_text(size = 11))

animate(animated_plot, renderer = gifski_renderer())

From 2005 to 2019, there is a consistent negative association between murder rate (log) and happiness score. The countries with higher happiness scores tend to have lower murder rates (log), while those with lower happiness scores tend to have higher murder rates (log). However, the strength of the negative association varies throughout the years going between moderate and weak. One unique observation is that some years have wider ranges of murder rates (log) and happiness scores than other years. An example is around 2012-2015, there are some countries with really high murder murder rates.

Code
country_murder_happiness <- murder_happiness |>
  group_by(country) |>
  summarise(avg_murder_rate = mean(murder_rate_per_100k),
            avg_happiness_score = mean(happiness_score))

country_murder_happiness |>
  ggplot(aes(x = log(avg_murder_rate), 
             y = avg_happiness_score)
         ) +
    geom_point(color = "steelblue") +
    geom_smooth(method = "lm", color = "black") +
    labs(title = "Relationship Between Average Murder Rate and Happiness Score (2005-2019)",
         subtitle = "Average Happiness Score (0-100)",
         x = "Log-Scaled Average Murder Rate (per 100k)", 
         y = "") +
    theme_bw()

For this graph, there seems to be a moderate, negative association between the overall average murder rate and happiness score. The dispersion of points around the line suggests that while murder rate may have an impact on happiness scores, it is likely not the sole determining factor. For different average murder rates (log), there is a wide range of both high and low average happiness scores. Because of this, there does not appear to be any noticeable extreme outliers or unusual observations, but there is some spread in the data, especially in the range of lower average murder rates.

2.2 Modeling

We will be using linear regression as a statistical method to model the relationship between murder rate and happiness score. We can then use this method to evaluate the model fit.

Note

Linear regression is appropriate because the LINE conditions are met. There is linearity between the explanatory and response variables, residuals are independent of each other, residuals follow a fairly normal distribution, and the variance of residuals are consistent across all levels of the explanatory variable.

Code
linear_model <- lm(avg_happiness_score ~ log(avg_murder_rate), country_murder_happiness)
broom::tidy(linear_model) |>
  mutate(
    estimate = round(estimate, 2),
    std.error = round(std.error, 2),
    statistic = round(statistic, 2),
  ) |>
  knitr::kable(digits = 5)
term estimate std.error statistic p.value
(Intercept) 57.77 1.27 45.65 0.00000
log(avg_murder_rate) -2.81 0.71 -3.94 0.00012

\[\begin{equation*} \text{Predicted Average Happiness Score} = 57.77 - 2.81 \times \ln(\text{Average Murder Rate}) \end{equation*}\]

The linear regression model suggests that for each unit increase in the natural logarithm of the average murder rate, the predicted average happiness score decreases by approximately -2.81 points. Also, when the natural logarithm of the average murder rate is 0, the predicted average happiness score is 57.77 points.

2.3 Model Fit

Code
var_table <- data.frame(
  Variable = c("Response Variable Variance", "Fitted Values Variance", "Residuals Variance", "Explained Variation (R^2)"),
  Value = c(
    var(country_murder_happiness$avg_happiness_score),
    var(linear_model$fitted.values),
    var(linear_model$residuals),
    var(linear_model$fitted.values) / var(country_murder_happiness$avg_happiness_score)
  )
)

var_table |>
  mutate(
    Value = round(Value, 2)
  ) |>
  knitr::kable(digits = 4)
Variable Value
Response Variable Variance 114.73
Fitted Values Variance 10.14
Residuals Variance 104.59
Explained Variation (R^2) 0.09

The variation of the actual happiness scores is 114.73, while the variance of the happiness scores predicted by our model is 10.14. There is a large difference between these values, meaning that our model does not account for much of the change in the actual data.

By taking the ratio, we find that the explained variation is 0.09. This means that murder rate explains about 9 percent of the variability in happiness in our model.

3 Simulation

Ou goal is to evaluate the performance of our linear regression model by comparing the observed data to simulated data generated using the model’s predictions. By simulating data in this manner, we can assess how well our model captures the underlying relationship between average murder rate and happiness.

3.1 Visualizing Simulated Data

The comparison between the observed and simulated data will be visualized through side-by-side plots showcasing the relationships modeled by the linear regression for both data sets.

Code
set.seed(42)

predictions <- predict(linear_model, country_murder_happiness)
residual_se <- sigma(linear_model)
simulated_y <- predictions + rnorm(n = length(predictions), mean = 0, sd = sigma(linear_model))

observed <- ggplot(country_murder_happiness, 
             aes(x = log(avg_murder_rate), 
                 y = avg_happiness_score)
             ) +
  geom_point(color = "steelblue") +
  labs(title = "Observed Data",
       subtitle = "Observed Happiness Score (0-100)",
       x = "Log-Scaled Average Murder Rate (per 100k)", 
       y = "") +
  theme_bw()

# Plot Simulated Data
predicted <- ggplot(country_murder_happiness, 
             aes(x = log(avg_murder_rate), 
                 y = simulated_y)
             ) +
  geom_point(color = "orange3") +
  labs(title = "Simulated Data",
       subtitle = "Simulated Happiness Score (0-100)",
       x = "Log-Scaled Average Murder Rate (per 100k)", 
       y = "") +
  theme_bw()

observed + predicted

Overall, the simulated data looks very similar to the observed data, but there are some differences. The simulated data is more densely gathered around the predicted values, suggesting deviations from a perfectly normal distribution within the observed data. The predicted data is also noticeably more linear, suggesting the observed data is not entirely linear.

These differences have negative implications on the appropriateness of modeling the observed data through linear regression. Despite this, we believe the conditions hold strongly enough that the linear model is relevant.

3.2 Full Scale Simulation

We plan to generate at least 1000 simulated data sets, representing what we would expect to observe if the regression model accurately captured the relationship between average murder rates and average happiness scores. By plotting the R-squared values, we will be able to discuss the implications of the values in the distribution.

Code
set.seed(42)

r_squared_values <- map_dbl(1:1000, ~ {
  simulated_y <- predictions + rnorm(n = length(predictions), mean = 0, sd = residual_se)
  simulated_dataset <- data.frame(avg_murder_rate = log(country_murder_happiness$avg_murder_rate), avg_happiness_score = simulated_y)
  simulated_dataset <- na.omit(simulated_dataset)
  summary(lm(avg_happiness_score ~ avg_murder_rate, data = simulated_dataset))$r.squared
})

ggplot(data.frame(R_squared = r_squared_values), 
       aes(x = R_squared)
       ) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = expression("Distribution of"~ R^2 ~"Values"),
       x = expression("Simulated"~ R^2),
       y = "",
       subtitle = "Number of Simulated Models")

The histogram exhibits a right-skewed distribution, with a peak centered around 0.08. A right skewed distribution is expected as the left direction is bounded by 0, while the right is unbounded. The majority of the values seem to fall within the range of 0.03 to 0.14. This suggests that, under similar conditions where the same conditions hold, our model would account for between 3% and 14% of the variability in happiness scores in most cases.